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The Zipf 's law is the major regularity of statistical linguistics that served as a prototype for rank- 
frequency relations and scaling laws in natural sciences. Here we show that the Zipf 's law — together 
with its applicability for a single text and its generalizations to high and low frequencies including 
hapax legomena — can be derived from assuming that the words are drawn into the text with random 
probabilities. Their apriori density relates, via the Bayesian statistics, to general features of the 
mental lexicon of the author who produced the text. 

PACS numbers: 89.75.Fb, 89. 75. Da, 05.65.+b 



The Zipf's law states that in a given text the ordered 
and normalized frequencies /i > /2 > ... for the oc- 
curence of the word with rank r behave as fr oc r''^ with 
7 S3 1 [1, 2]. This law applies to texts written in many 
natural and artificial languages. Its almost universal va- 
lidity fascinated generations of scholars, but its message 
is still not well understood: is it just a consequence of 
simple statistical regularities [3, 4], or it refiects a deeper 
structure of the text [5]? Many approaches were pro- 
posed for deriving the Zipf's law suggesting that it can 
have different origins. They are divided into two groups. 

(1) Certain theories deduce the law from certain gen- 
eral premises of the language [3, 6-10]. The general prob- 
lem of derivations from this group is that explaining the 
Zipf's law for the language (and verifying it for a fre- 
quency dictionary) does not yet mean to explain the law 
for a concrete text, where the frequency of the same word 
varies widely from one text to another and is far from its 
value in a frequency dictionary [12]. 

(2) The law can be derived from certain probabilistic 
models [4, 11-16]. Albeits some of these models assume 
relevance for realistic text-generating processes [14, 15], 
their a priori assumed probability structure is intricate, 
hence the question "why the Zipf's law?" translates into 
"why a specific probabilistic model?" By far most known 
probabilistic model is a random text, where words are 
generated through random combinations of letters and 
the space symbol seemingly reproducing the fr oc 
shape of the law [3, 4]. But the reproduction is elusive, 
since the model leads to a huge redundancy — many words 
have the same frequency and length — absent in normal 
texts [17]. 

Our approach for deriving the Zipf's law also uses a 
probability model. It differs from previous models in sev- 
eral respects. First, it explains the law for a single text 
together with its limits of validity, i.e. together with the 
range of ranks where it holds. It also explains the rank- 
frequency relation for very rare words (hapax legomena) 



TABLE L Parameters of 3 texts: The Age of Reason (AR) by 
T. Paine, 1794 (the major source of British deism). Thoughts 
on the Funding System and its Effects (TP) by P. Ravenstone, 
1824 (economics). Dream Lover (DL) by J. Maclntyre, 1987 
(romance novella). Total number of words A*', number of dif- 
ferent words n, the lower rmin and the upper rmax ranks of 
the Zipfian domain, the fitted values of c and 7. 



Texts 


N 


n 






c 


7 


TF 


26624 


2067 


36 


371 


0.168 


1.032 


AR 


22641 


1706 


32 


339 


0.178 


1.038 


DL 


24990 


1748 


34 


230 


0.192 


1.039 



and relates it to the Zipf's law. Second, the a priori struc- 
ture of our model relates to the mental lexicon [18] of the 
author who produced the text. Third, the model is not 
ad hoc: it is based on the latent semantic analysis that 
is used successfully for text modeling. 

The validity range of the Zipf's law. Below we 
present empirical results examplified on 3 English texts 
[see Table I] that clarify the validity range of the law, 
confirm known results, but also make new points that 
motivate the theoretical model worked out in the sequel. 

For each text we extract the ordered frequencies of n 
different words: 



(1) 



To fit {fr}r=i to the Zipf's form fr — cr~'' , we represent 
the data as {yrixr)}r=ii where yr — In/r and Xr = Inr, 
and fit it to the linear form {yr = Inc — 73;^}"^!. Two 
unknowns Inc and 7 are obtained from minimizing the 
sum of squared errors SScrr = J2r=iiyr ^ VrY [28]. Now 
miuc,-), [S'S'crr] = 'S'<S'*].i. and the correlation coefficient 
between {yr}^=i and {yr}r=i [20, 28] measure the fitting 
quality: SS^.^.^ and E? ^ \ mean good fitting. We 



minimize S Scrr over c and 7 foi" ' 1 
find the maximal value of rmax 
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FIG. 1: (Color online) Frequency vs. rank for the text TF; 
see Table I. Red line: the Zipf curve fr = 0.168r"^ °^^ Ar- 
rows indicate on the validity range of the Zipf's law. Blue 
line: the solution of (9, 10) for c = 0.168 and n = 2067. It co- 
incides with the generalized Zipf law (14) for r > rmin = 36. 
The step-wise behavior of fr for r > rmax refers to hapax 
legomena. 



and 1 — are smaller than, respectively, 0.05 and 0.005. 
This value of rmax — ^min also determines the final fitted 
values of c and 7; see Table I and [28]. 

1. For each text there is a specific (Zipfian) range of 
ranks r e [rmin, rmax], where the Zipf's law holds with 
7 w 1 and c < 0.2 [1, 2]; see Table I and Fig. 1. 

2. Even if the same word enters into different texts 
it typically has quite different frequencies there [12], e.g. 
among 83 common words in the Zipfian ranges of AR 
and DL [see Table I], only 12 words have approximately 
equal ranks and frequencies. 

3. The pre- Zipfian 1 < r < rmin range contains mainly 
function words. They serve for establishing grammatical 
constructions (e.g., the, a, such, this, that, where, were). 
But the majority of words in the Zipfian range do have 
a narrow meaning (content words). A subset of those 
content words has a meaning that is specific for the text 
and can serve as its keywords [21]. Below [in 15] we 
explain why the key- words appear in the Zipfian domain. 

4. The absolute majority of different words with 
ranks in [r,„in,rmax] have different frequencies. Only 
for r ~ rmax the number of different words having the 
same frequency is ~ 10. For r > rmax we meet the ha- 
pax legomena: words occuring only few times in the text 
{frN — 1, 2, ... is a small integer), and many words hav- 
ing the same frequency fr [2] . The effect is not described 
by a smooth rank-frequency relation, including the Zipf's 
law. 

5. The minimal frequency of the Zipfian domain holds 
/i-max > c/n. We checked that this is valid not only for 
separate texts but also for the frequency dictionaries of 
English and Irish. For our texts a stronger relation holds 
fr > i. Hence L > ^ > 1; see Table I. 

J ' max n J ' max ji , 

Introduction to the model. A model for the Zipf's 
law is supposed to satisfy the following features. 

(I) Apply to separate texts, i.e. explain how different 



texts can satisfy the same form of the rank-frequency 
relation despite the fact that the same words do not occur 
with same frequencies in the different texts; see 2. 

(II) Derive the law together with its extensions for all 
frequencies, limits of validity and hapax legomena effect. 

(III) Relate the law to formation of a text. 

Two sources of the model are the latent semantic anal- 
ysis [22], and the idea of applying ordered statistics for 
rank- frequency relations [8, 24, 25]. 

Our model makes four (A — D) assumptions. 

A. The bag-of-words picture focusses on the frequency 
of the words that occur in a text and neglects their mu- 
tual disposition (i.e. syntactic structure) [23]. Given n 
different words {wk}^^i, the joint probability for Wk to 
occur fk > times in a text T is multinomial 

Vl\...Vn\ 

where N — 1^ the length of the text, is the 

number of occurrences of Wk, and Ok is the probability 
of Wk- The picture is well-known in computational lin- 
guistics [23]. But for our purposes it incomplete, because 
it implies that each word has the same probability for 
different texts [recall (I)]. 

B. To improve this point we make 9 a random vector 
[23] with a text-dependent density P{6\T). The simplest 
assumption is that (T, 6, u) form a Markov chain: the 
text T influences the observed u only via 0. Then the 
probability p{u\T) of 1/ in a given text T reads 

p{v\T)= jdO T:\u\e] P{e\T). (3) 

This form of p{v\T) is basic for probabilistic latent se- 
mantic analysis [22], a successful method of computa- 
tional linguistics. There the density P{0\T) of latent 
variables is determined from the data fitting. But we 
shall deduce P{0\T) theoretically. 

C. P(6\T) is generated from a density P{9) via con- 
ditioning on the ordering of w = {wk}^^i in T: 

p{e\T) = p{e)xT{e,v,) I Jde' p{e')xT{e',w) . (4) 

If different words of T are ordered as {wi, ...,w„) with 
respect to the decreasing fequency of their occurence in 
T (i.e. wi is more frequent than W2), then xt(^,w) = 1 
if di > ... > On, and xt(^, w) = otherwise. 

As substantiated below, P{0) refers to the mental lex- 
icon of the author prior to generating a concrete text. 

D. For simplicity, we assume that the probabilities Ok 
are distributed identically and the dependence among 
them is due to X)I-=i ^fc — ^ only: 

F(0) (xu(0i)...w(0„)<5(V" Ok~l), (5) 

^ — ^ k—l 

where 5{x) is the delta function and the normalization 
ensuring Jg°° 11^=1 ^^fc P{9) = 1 is omitted. 
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Solution of the model and the Zipf's law. The 
conditional probability 2^) for the r'th most frequent 
word Wr to occur v times in the text T reads from (2, 3) 



Pr{iy\T) 



m 



P^{t\T) = j de P{e\T)5{t - 9r) 



{i-er-^Pr{e\T)iQ) 



(7) 



where Pr{t\T) is the marginal density for the probability 
t of Wr- For n 1, we deduce from (4, 5) that Pr{t\T) 
follows the law of large numbers [28]. It is Gaussian, 



Pr{t\T) (X exp[ 



n 
'2^ 



{t - <t>rf 



where cFj. = [for (j)^ = o(l)], and the mean 

found from two equations for two unknowns fx and 



(8) 



IS 



POO /poo 

r/n= d6iu(6»)e-''^" / / d6» w(6») e""^" , (9) 

J4>r I Jo 

/ d6'6'M(0)e-''^" = - / di9w(6')e-''^". (10) 
Jo n Jo 

Eq. (8) holds for Pr{t\T) whenever its standard deviation 
ar'n~^^^ is much smaller than the mean 0^; as checked 
below, this happens already for r > 10. 

6. The meaning of (9, 10) is explained via the marginal 
density P{0i) = U.l=2<^^k PiO) oc u{el)e-^'^'" found 
from (5) [28]. Eq. (10) ensures that /^d6'(9P(6') = ^. 
This relation follows from X]fe=i ^fe = 1 ^-^d it determines 
/U, an analogue of the chemical potential in statistical 
physics [28]. The interpretation of (9) is that it equates 
the relative rank r/n to the (unconditional) probability 
J'^^depio) of0><Pr. 

Let us study implications of (6-10) for the Zipf's law. 

7. In (6), Pr{9\T) is much more narrow peaked than 
e^il - 9)^-", since > iV > 1 [see Table I]. Hence 
in this limit we approximate Pr{9\T) by delta-function 
5{9-(l)r) [see (8)]: 



pMt) = 



JV-i/ 



(11) 



Eq. (11) is the main outcome of the model; it shows 
that the conditional probability prii'lT) for the occurence 
number v of the word Wr has the same form (11) for dif- 
ferent text (see I). In (11), (pr is the effective probabil- 
ity of the word Wr- If N<pr ^ 1, Pr{i^\T) is peaked at 
u = N(j)r: the frequency of a word that appears many 
times equals its probability. Each word of the Zipfian 
domain occurs at least u ~ N/n » 1 times; see 5. For 
such words wo approximate fr = v/N ~ (p^. 
8. Now we postulate in (5) 



u{f) = {n-'c + f)- 



(12) 



where c is related below to the prefactor of the Zipf's law. 
Eq. (12) is explained in 13-15 below. 



TABLE II: Description of the hapax legomena for the text 
TF; see Table I and (15). The maximal relative error = 
0.0357 is reached for k — 6. 



r/k 


1 


2 


3 


4 


5 


6 


7 


8 


9 


10 


r-k 


1446 


1061 


848 


722 


611 


529 


474 


437 


398 


370 


fk 


1414 


1074 


866 


726 


624 


547 


488 


440 


400 


368 



9. For c < 0.2, c/z determined from (10, 12) is small 
and is found from integration by parts: 



/i ~ c ^ e '^^ 



(13) 



where 7e = 0.55117 is the Euler's constant. One solves 
(9) for c/i 0: ^ = ce-"'^'-''/(c + n<i)r). For r > r^in, 
4>rnii = frnji < 0.04 < 1; see (13) and Table I. We get 



/, = c(r-i-n-i). 



(14) 



This is the Zipf's law generalized by the factor at 
high ranks r. This cut-off factor ensures faster [than 
r~^] decay of fr for large r. In literature a cut-off factor 
similar to ^ is introduced due to additional mechanisms 
(hence new parameters); sec [14]. In our situation the 
power-law and cut-off come from the same mechanism. 

Fig. 1 shows that (14) reproduces well the empirical 
behavior of fr for r > r-cnXn- Our derivation shows that c 
is the prefactor of the Zipf's law, and that our assump- 
tion on c < 0.2 above (13) agrees with observations; see 
Table I. For c > 0.2, (9, 10) do not predict the Zipf's law 
(14). 

10. For given prefactor c and the number of different 

words n, (9 12) predict the Zipfian range [rmin) ''max] in 
agreement with empirical results; see Fig. 1. 

11. For r < rmin, it is not anymore true that /^n/i 

1. So the fuller expression (9) is to be used. It reproduces 
qualitatively the empiric behavior of fr] see Fig. 1. 

12. According to (11), the probability (jjr is small for 
T ^ ^max and hence the occurence number v = frN of a 
words Wr is a small integer (e.g. 1 or 2) that cannot be 
approximated by a continuous function of r; see (12) and 
Fig. 1. To describe this hapax legomena range, define 
rk as the rank, when u = frN jumps from integer k to 
fc -|- 1. Since reproduces well the trend of fr even for 
r > r,jjaxi see Fig. 1, can be theoretically predicted 
from (14) by equating its left-hand-side to k/N: 



rk 



r k 1, 



fc = 0,l,2,... 



(15) 



Eq. (15) is exact for k = 0, and agrees with rk for fc > 1; 

see Table II. Hence it describes the hapax legomena phe- 
nomenon (many words have the same small frequency) 
[26]. 

Preliminary summary. Thus 9-12 achieved the 
promises (I) and (II) of our program: though different 
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texts can have different frequencies for same words, the 
frequencies of words in a given text follow the Zipf 's law 
with the correct prefactor c < 0.2. Without additional 
fitting parameters and new mechanisms we recovered the 
corrected form of this law applicable for large and small 
frequencies [see 11, 12]. But why we would select (12), 
if we would not know that it reproduces the Zipf's law? 
Answering this question will fulfil (III). 

Mental lexicon and the apriori density. Here we 
explain the choice (5, 12) for the apriori probability den- 
sity for the probabilities 6 = {9i,...,9n) of different words 
{wi, ...,Wn)- To avoid the awkward term "probability for 
probability" we shall call P{9) likelihood. We focus on 
the marginal likelihood [sec 6 and (12)]: 



(16) 



since P{0) determines the rank- frequency relation (9). 
For a more detailed discussion of the items below see 
[28]. 

13. The basic reason for the words to have random 
(variable) probabilities is that the text-producing author 
should be able to compose different texts, where the same 
word can have very different frequencies [sec I]. Hence 
P{d) relates to the prior knowledge (or lexicon) of the 
author on words. This concept of mental lexicon is an 
established one in psycholinguistics [18, 19]. 

14. Once each word Wk has to have a variable proba- 
bility 6k, there should be a way for the author to increase 
it, e.g. when the authors decides that Wk should become 
a keyword of the text. The ensuing relation between 
the probability vectors 6' (new) and 6 (old) should be 
a group, since the author should be able to come back 
from 6' to 6 when revising the text. Under certain nat- 
ural conditions, the only such group with parameters Tfe 
is [27]: 



Tfe > 0, l,...,n, (17) 



Eq. (17) is a generalized Bayes formula [27, 28]. It is 
used in the Bayesian statistics for motivating the choice 
of priors [27], a task related to ours. 

If the author wants to increase n times the probability 
of the word w\, then in (17) n > 1 and Tk>2 = 1: 



9[ = 



1 + (n - 1)^1 



1 + (n - 1)^1 ' 



for I > 2.(18) 



The inverse of (18) is found by interchanging O'/^ with 9k 
and Ti with t^^. For the Zipf's law the relevant proba- 
bilities are small, 6[ < 0{l/n); see 9 and Fig. 1. Then 
1 + (T]~^ — l)9[ ~ 1 and (18) becomes the scaling trans- 
formation of one variable: 9[ = Ti9i, 6[ = 9i, I > 2. The 
new likelihood reads from (18, 16) 



m) = -^(-) 

n n 



Ti n Ti 



Other densities do not change P'(9'i) = P{9[) for I > 2. 

15. Once P{9) describes the mental lexicon, and (17) 
is an operation by which the text is written, we suppose 
that the features of P{9) can be explained by checking 
its response to (17). For the ratio of the new to the old 
likelihood of the probability 6'i we get from (19) 

P'{9[)/P{9[) = n > 1 for 9[ > cn/n, (20) 
= rf ^ < 1 for 9[ < CTi/n. (21) 

The meaning of (20, 21) is that once the author decides 
to increase the probability of the word wi by n times, 
this word will be ri times more likely produced with 
the higher probabilities, and ri times less likely with 
smaller probabilities; see (21). This is the mechanism 
that ensures the appearance of the keywords in the Zip- 
fian range. It is unique to the form (16) of the marginal 
likelohood, which by itself is due to the form (12) of u{6). 

If P{9) is assumed to reflect the organization of the 
mental lexicon, then according to (20, 21) this organi- 
zation is efficient, because the decision on increasing the 
probability of translates to increasing the likelihood of 
larger values of the probability. The organization is also 
stable, since the likelihood at large probabilities increases 
right at the amount the author planned, not more. 

Conclusion. We answer the first question asked in the 
introduction: the Zipf's law — together with the limits of 
its validity, its generalization to high and low frequencies 
and hapax legomena — relates to the stable and efficient 
organization of the mental lexicon of the text-producing 
author. Practically, our derivation of the Zipf's law will 
motivate the usage of prior (12) in the schemes of la- 
tent semantic analysis. We expect these schemes to be 
more efficient for real texts, if the prior structure of the 
model conforms the Zipf's law. The proposed methods 
can find applications for studying rank-frequency rela- 
tions and power laws in other fields. 

We thank A. Galstyan and D. Manin for discussions. 
This work is supported by the Region des Pays de la Loire 
under the Grant 2010-11967. 
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Supplementary Material 



In this supplementary material to the main text we re- 
view the linear fitting method, derive and clarify Eqs.(8- 
10) from the section Solution of the model and the 
Zipf's law of the main text, derive the expression for 
the marginal probability [Eq. (16) and point 6 of the 
main text] , and discuss in more detail the content of sec- 
tion Mental lexicon and the apriori density. These 
tasks are carried out in, respectively, sections I, II, III 
and IV below. 



I. LINEAR FITTING 

Here we recall the main ideas of the linear fitting method 
that is employed in the section The validity range of 
the Zipf's law of the main text. 

Table I of the main text presents 3 texts we studied (we 
worked out more texts that consistently show the same 

applicability pattern of the Zipf's law). For each text we 
extract the ordered frequencies of different words [the 
number of different words is n; the overall number of 
words in a text is A''] : 



{/.}"=1, /l>... >/n, Jr = l. 



(22) 



We should now see whether the data {fr}r=i fits to a 
power law: = cr ' . We represent the data as 



{yr{Xr)}r=l, 2/r = In/r-, Xr = lur. 



(23) 



and fit it to the linear form {yr = Inc — 7xv}"=i. Two 
unknowns Inc and 7 are obtained from minimizing the 
sum of squared errors: 

SSerr^y" {yr-yrf. (24) 

It is known since Gauss that this minimization produces 



-7 



Y2=i(^k - x){yk - y) 



where we defined 

_ 1 ■r-^" _ 1 ■r-^" 

As a measure of fitting quality one can take 

minc,7[S'ySerr(c, 7)] = S SeTi{c* , J*) = SS^r 



\nc*=y + ^*x, (25) 



(26) 



(27) 



This is however not the only relevant quality measure. 
Another (more global) aspect of this quality is the coef- 
ficient of correlation between {j/r}r=i and {yr}r=i [29]: 



.Efe=i(2/'= - y)(yk 



where 



Tru=Ayk-yyYru=M-v*Y 

= Inc* - 7*a:^}"=i, y* = ■^^^^^vl- 



(28) 



(29) 
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For the linear fitting (25) the squared correlation coeffi- 
cient is equal to the coefficient of determination, 



(30) 



the amount of variation in the data explained by the 
fitting [29]. Hence SSg^^ — > and ^ 1 mean good 
fitting. We minimize SSbtt over c and 7 for rmin < r < 
Tmax and find the maximal value of rmax — rmin for which 
SScrr and 1 — are smaller than, respectively, 0.05 and 

0. 005. This value of rmax — ^min also determines the final 
fitted values c* and 7* of c and 7, respectively; see Tables 

1, II and Fig. 1. Thus c* and 7* are found simultaneously 
with the validity range [rmax, rmax] of the law. Whenever 
there is no risk of confusion, we for simplicity refer to c* 
and 7* as c and 7, respectively. 



II. DERIVATION OF EQS. (8-10) OF THE MAIN 
TEXT. 

In (7) of the main text we defined Pr{t\T): the marginal 
density for the probability t of the word Wr- Using (4,5) 
of the main text, we rewrite (7) of the main text as 

Prit\T) oc / d6i d6»2 / dOs... / de„ x 
Jo Jo Jo Jo 

p{ei,...,en)Sit-er), (3i) 



where 



P{0) (X u{ei) ... u{9„) s{J2l^^9k - 1), 



(32) 



as given by (7) of the main text. Recall that = 

{61, ...,6n). 

In (32) we employ the Fourier representation of the delta- 
function. 



dz 
2m ' 



(33) 



put (32) into (31) and then apply integration by parts. 

The result reads 

/too fiypZ 
^xr''{t,z)xr\t,z)e-'\ (34) 



where 



ft fOO 

Xo{t,z)= / dye'^'^uiy), xi{t,z)= / dye~ 
Jo Jt 



The integral in (34) will be worked out via the saddle 
point method. But before that wc need to fix the scales 
of the involved quantities. To this end, make the following 
changes of variables 



z = z/n, i = tn, y = yn, r = r/n. 
Then Pr{t\T) reads from (34) 

d.Z ntp{i,z)-iz 



Pr{t\T) OC U{t) I 
J —i 



2m 



(35) 



(36) 



(p{i, z) 



z + 



(1 - f) In / 
Jo 



dye-^" 



/- Ix, f°° dye-^y 
+ r - - In / / 37 



where in (37) wc already used u{t) = {n~^c + t)~^; see 
(12) of the main text. 

If n ^ 1 and < f < 1 is a finite number (neither 
close to one, nor to zero), the behavior of pr{t) in various 
averages, e.g. J Attpr{t)^ is determined by the values of 
z = Zs and t = f, that maximize (j){t, z). They are found 
from saddle-point equations 



di4>{ts, is) = di(t>(is, is) = 0. 



(38) 



After reworking the two equations (38) we get Eqs. (9,10) 
of the main text. 

Due to (35), Zs (that is real and positive) and ts stay 
finite for n 3> 1. Hence the integration line over z in (36) 
is shifted to pass through Zs (the saddle-point method). 
Now z) is expanded axound z = Zs and i = is [first- 
order terms nullify due to (38)]: 

4>{i, z) = ct>{is, Zs) + ^dttct>{is, zs){i-is)^ (39) 



+ ^dzz<t>(tsj Zs){z - Zsf 
+dfi4>{is, Zs){i - ia){z - Zs) + .. 



(40) 
(41) 



Now only these terms can be retained in the integral over 
z. Since this integral goes over the imaginary axis, while 
Zs is real, the integration contour is to be shifted to pass 
through is . For the convergence of the resulting Gaussian 
integral we need ^dii4>{is, Zs) > 0. Taking this Gaussian 
integral leads us to [up to factors that either constant or 
irrelevant for n » 1] 



J_ ^ [dti<t)(is,is) 
o'^ dii(j){ts,is 



r(«- 



- dll4>{ts,Zs)- 



(42) 
(43) 



Hence Pr{t\T) is approximately Gaussian, with the stan- 
dard deviation 0{n~^^'^) much smaller than the average 
for is = 0(1). 

In working out (43), we shall employ the fact that in (37) 
0s = M is a small parameter; see (13) of the main text. 
This produces [up to smaller corrections] 



(44) 



a = {c + ts)yjt. 



Eq. (42) derives (8) of the main text, while (44) accounts 
for the estimate of a that was presented after (8) of the 
main text. 



III. DERIVATION OF THE MARGINAL 
PROBABILITY (EQ. (16) AND POINT 6 OF THE 
MAIN TEXT). 

The marginal probability P{t) is defined from (32) as 

Pit) = J dOPiO) S{t - 9r). (45) 
using (32, 33) we obtain from (45) 

/ioo ^ ~ 
az_ ^„^(t,2)-«^ ^^g^ 
-ioo 27rz 

roo 

(t>{t, z) = {l- t)z + In / dy e-'" (c + y)-^. (47) 
Jo 
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Wc use the saddle-point method for (46). This produces 
the same saddle-point equation (38) for Za, 



'{c + yy 



(48) 



provided that we note the dominant range t oc 1/n <C 1 
of t. Thus 



P(6>) oc u(e)e 



-nBzs 



(49) 



This validates Eq. (16) of the main text, as well as its 
point 6. 

Likewise, one can show that the marginal density 
P(^i, Om) factorizes provided that m <^ n: 

P{eu 0m) oc w(^i)e-'''i" ... w(^^)e-'''-". (50) 

Eq. (50) can be established more heuristically via the 
exact relation [X]fe=i ~ Ij where / moans averaging 
over P{6i, ...^9n)- This relation predicts, together with 
= -^j that OiOj — Oi Oj = 0{n~^), hence approximate 
factorization. 

Using (49) with u(e) = (^ + 6")"^ we note that the 

standard deviation {(0 - {e)f) = ^^f^ - 1 ~ 

is larger than the average {6} = J d.66P{0) = i, since 

C/Zs > 1. 



IV. MENTAL LEXICON AND APRIORI 
DENSITY 

This is an expanded version of the coresponding section 
of the main text. We explain the choice 

P(0)ocM(ei)...M(^„)5(V" Ok-l), (51) 

^— ' k=l 

m(6I) = (en"' +61) -^ (52) 

for the apriori probability density for the probabilities 
= {01,..., 9n) of different words wi, ...,Wn- To avoid the 
awkward term "probability for probability" we shall call 
P(0) hkelihood. 

Recall that the marginal likelihood deduced from (51) 
reads 



P(6>) = (n-'c + 6i)-^e-'''" 



(53) 



where n is determined by (12,16) of the main text. 
Wc shall explain the choice (52) via the features of the 
marginal likelihood (53), because it eventually deter- 
mines the rank-frequency relation leading to the Zipf's 
law. 

The numbering of the items 13-15 below coincides that 
in the section Mental lexicon and the apriori density 
of the main text. The items 13.2, 13.3, 13.4, 16 and 17 
below are added additionally, they are absent in the main 
text. 

13.1 Recall that the basic reason for the words to 
have random (not fixed) probabilities is that the text- 
producing author should be able to compose different 

texts, where the same word can have different frequen- 
cies. Hence the likelihood P{0) of random probabilities 



relates to the prior knowledge (or lexicon) of the text- 
generating author on the words. This concept of men- 
tal lextcon — the store of words in the long-time uicuiory 
so that the words are employed on-line for expressing 
thoughts via phrases and sentences — is well-established 
in psycholinguistics [30]. Though there is no a unique 
theory of mental lexicon — there is only a diverse set of 
competing models [30] — some of its basic features are 
well-established experimentally and are employed below 
for explaining the choice (51, 52). 

13.2 We assume that during the conceptual planning of 
the text, i.e. when deciding on its topic, style and po- 
tential audience, the author already chooses (at least ap- 
proximately) two structural parameters: the number n 
of different words to appear there and the constant c. 
This is why the marginal likelihood (53) depends on the 
parameters c and n. We recall that c (along with n) is 
a structural parameter of the text, because according to 
the point 5 of the main text, c/n separates the Zipfian 
(keywords dominated) range from the hapax legomena 
range (rare words). 

13.3 Note that different words have the same marginal 
likelihood (53). Put differently, the likehhood P(6») 
is symmetric with respect to interchanging the words 
wi,...,w„. This feature relates to am experimental fact 
that words are stored in the mental lexicon in the same 
way [34]. The difference between them — e.g. whether the 
word is more familiar to the author, and/or used by 
him more frequently — can be relevant during the (later) 
phonologization stage of speech/text production [34]; in 
this context see also the item 17 below. 

Naturally, the above symmetry holds for the apriori like- 
lihood. The posterior likelihood P{6\T) (see (6) of the 
mail! text), the one that is conditioned over the written 
text, does not and should not have such a symmetry. 

13.4 Note that the marginal likelihood (53) concentrates 
at small probabilities 6 ~ c/n. The concentration holds 
locally — since P{9) is peaked at 6 = and is approxi- 
mately constant for 6 <C c/n — and also globally, i.e. on 
the level of the full probability: 



Pr[6l <a]= f deP(e) = f d6>P(6>) = Pr[6i > o], 

Jo J a 



(54) 



for a c:^ c/n. 



If a is sufficiently larger (smaller) than c/n, the Icft-haud- 
side of (54) is larger (smaller) than its right-hand-side. 
The local and global concentrations are different from 
each other. For example, consider P{6) oc 0~^/^e~^''". 
It displays a local concentration around ^ ~ 0, but (54) 
(global concentration) predicts a ~ -j^^. 
Hence according to (53), apriori (i.e. before the text is 
written) all the (content) words have small probabili- 
ties. This is explained as follows. Since the majority of 
words in the mental lexicon are potential keywords of 
some texts, apriori (i.e. before the text is written) they 
have small probabilities. Indeed, the defining (and oper- 
ationally used) feature of a keyword is that its frequency 
in a given text is much larger than its frequency in a large 
mixture of different texts [33]. Thus the apriori likelihood 
of the probability should be concentrated at small prob- 
abilities d ~ c/n. 

14. Once each word Wk has to have a variable (random) 
probability 9k, there should be a way for the author to 
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change (increase or decrease) this probability, e.g. when 
the author decides that the word Wk is to become the key- 
word of the text. The ensuing relation between the prob- 
ability vectors 0' (new) and (old) should be a group, 
since the author should be able to come back from 6' to 
0, e.g. when revising the text. 

One can impose two natural restrictions on this group 
[31]. These restrictions follow the general idea that the 
meaning of as probabilities of certain events is con- 
served during the transformation. 

First, the words that have strictly zero probability ^fc = 
will stay zero, i.e. = if and only if 9k = 0. 
Second, the probability mixtures are conserved: if 



6> = Ax + (1 - A)r7, < A < 1, 



(55) 



where x = (xii •■•) X") and 77 = (r?i, ri„) are arbitrary 
probability vectors, and where A is a (mixing) parameter, 
then 



0' = Ax' + (1 - A)»7'. 



(56) 



Here primed and non primed probability vectors relate 
to each other via the sought group. 

The only group that (for n > 3) is consistent with the 
above two conditions is [31]: 



Tk > 0, 



(57) 



where Tk are the group parameters. If the author wants 
to increase two times the probability of the word wi , then 

n = 2 and rk>2 = 1. 

Note that (57) becomes the Bayes formula if we relate 
Tk to a conditional probability [32]. In this alternative 
interpretation of (57), the author has to retrieve a word 
w having certain specific features (i.e. it is a transitive 
verb) from the set of words wi, w„ having probabilities 
61, On- If we denote by Pr(i5|w = Wk) the conditional 
probability that the word Wk displays the needed feature 
E, we can relate in (57) Tk = Pr(_E|w = Wk), and (57) 
will describe the searching process for the word having 
the needed feature E. 

15. Since P{ff) is the basic description of the mental lex- 
icon that enters into our model, and once (57) is an oper- 
ation by which the text is ultimately written, it is natural 
to suppose that the features of P{6) can be explained by 
checking its response to (57). It is with a similar purpose 
of motivating the prior likelihood that (57) is applied in 
Bayesiam statistics [31, 32). There, however, the attention 
is focused on the non-informative prior likelihood that 
will stay invariant under (57). This is not suitable for our 
purpose precisely because we expect that the mental lex- 
icon — whose organization P{0) refers to — will somehow 
reflect the basic mechanism (57), i.e. P{6) will display 
specific changes under (57). 

In interpreting those changes, we adapt (57) to the prob- 
ability increase of a single word wi , whose probability the 
author decides to increase by n > 1 times. Thus, (57) is 
applied for T2 = ...r„ = 1: 



l-H(ri-l)^i' ' l-^(ri-l)ei 
The inverse of transformation (58) reads 



i + (rri-i)ei' i + {T^'-m- 



for I > 2. (58) 



(59) 



In the frequency range we are interested in, {r^^ — l)9'i 
can be neglected, hence (59) just reduces to the scaling 
transformation: 



(60) 



The change of the marginal likelihood for 61 is deduced 
from (53, 60): 



P'(^;) = lp(^) = l(£ + ^)- 



7l^-2 

Tl ■ Tl ' Tl 'n Tl' 



(61) 



Thus, for the ratio of the new to the old likelihood of the 
probability O'l we get 

P'{e[)/P{e[) = n > 1 for e[ > cn/n, (62) 
= < 1 for e'l < cTi/n. (63) 

The meaning of (62) is that once the author decides 
to increase the probability of the word wi by n times, 
this word will be n times more likely produced with the 
higher probabilities, and ri times less likely with smaller 
probabilities; see (63). The feature is unique to the form 
(53) of the marginal likelohood, which by itself is due to 
the form (52) of u{9). This is the mechanism that ensures 
the appearance of the keywords in the Zipfian range. 
If P{0) is assumed to reflect the organization of the men- 
tal lexicon, then according to (62, 63) this organization is 
efficient, because the decision on increasing the probabil- 
ity of wi translates to increasing the likelihood of larger 
values of the probability. The organization is also stable, 
because the likelihood at large probabilities does increase 
right at that amount the author planned (not more). 
16. Above we related the prior likelihood P{0) to the 
organization of the mental lexicon. Now we would like to 
clarify this relation by looking at some alternative forms 
of the marginal likelihood, e.g. 



which will produce 

P{0) = (cn-^ -|-6i)-'e-"^'^. 



Here p, is determined from 



JO 



dyjy - 1) 
c + y 



= 0, 



(64) 



(65) 



(66) 



by analogy to (12) of the main text. 
It is clear that instead of (62), we now get P' {e[) / P{e[) = 
1, i.e the likelihood of large probabilities docs not change 
at all. This indicates on the lack of organization in the 
mental lexicon (or at least very inefficient organization). 
The rank-frequency relation generated by (65) will read 
by analogy to (11) of the main text 



r 
n 



oo dy 
4trn c+y 



poo dy 

Jo 1+^ 



(67) 



In the limit of a sufficiently small c, the rank-frequency 
relation obtained from (67) is exponential. 



:ane-""'', a = ln(l/c). 



(68) 



instead of the Zipf 's law. According to (68) the majority 
of words have neglegible frequencies, hence a small group 
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of high-frequency words dominates the text. Intuitively, 
this connects well with the above statement on the lack 
of organization. 

17. The message of (62, 63) closely relates (but is not 
completely identical) to the word-frequency effect well- 
known for the mental lexicon: more frequently used 
words are produced (recalled) more easily [30, 34, 35]. 
In the context of (62, 63) this implies that the words 
that are decided to appear with more probability (e.g. 
the keywords) will be more likely produced with higher 
probabilities. 

Note that there is no contradiction between the message 

of (62, 63) and the fact that all the words have the same 
marginal apriori likelihood [see (53)]. The latter aspect 
refers to the word as emerging from the mental lexicon, 
while the former implicitly refers to the initial stages of 
writing the text. 

The same distinction is well known for the proper word- 
frequency effect in speech production, i.e. producing 
words from the mental lexicon [34] . According to the ac- 
cepted model [34] of this process, during the first stage of 
speech production the author conceptualizes his thought 
into the abstract form of the word (lemma). This form 
reflects the meaning of the word and its syntactic usage, 
but is not yet to be put in syllabic form and pronounced 
[34]. The word- frequency effect comes into play during 



this second stage, but is absent when the lemma is acti- 
vated in the mental lexicon [34]. This is why the word- 
frequency can be even reversed more frequent words are 
recognized more easily — for those tasks (e.g. recognition) 
that include mainly the lemma activation [35] . 
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